Sains Malaysiana 52(12)(2023): 3879-3892

http://doi.org/10.17576/jsm-2023-5212-19

 

An Efficient Method of Identification of Influential Observations in Multiple Linear Regression and Its Application to Real Data

 (Kaedah yang Cekap bagi Pengecaman Cerapan Berpengaruh dalam Model Regresi Linear Berganda dan Kegunaannya dalam Set Data Sebenar)

 

HABSHAH MIDI1,* , HASAN TALIB HENDI1 , HASSAN URAIBI2, JAYANTHI ARASAN 3 &  SHELAN SAIED ISMAEEL4

 

1Institute for Mathematical Research, Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, Malaysia

2Department of Statistics, University of Al-Qadisiyah, IRAQ

 3Department of Mathematics & Statistics, Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, Malaysia

4Department of Mathematics, Faculty of Science, University of Zakho, Iraq

 

Received: 20 June 2023/Accepted: 14 November 2023

 

Abstract

Influential observations (IOs) are those observations which either alone or together with several other observations have detrimental effect on the computed values of various estimates. As such, it is very important to detect their presence. Several methods have been proposed to identify IOs that include the fast improvised influential distance (FIID). The FIID method has been shown to be more efficient than some existing methods. Nonetheless, the shortcoming of the FIID method is that, it is computationally not stable, still suffers from masking and swamping effects, time consuming issues and not using proper cut-off point. As a solution to this problem, a new robust version of influential distance method (RFIID) which is based on Reweighted Fast Consistent and High Breakdown (RFCH) estimator is proposed.  The results of real data and Monte Carlo simulation study indicate that the RFIID able to correctly separate the IOs from the rest of data with the least computational running times, least swamping effect and no masking effect compared to the other methods in this study.

 

Keywords: Good leverage point; influential distance; influential observations; Reweighted Fast Consistent and High Breakdown (RFCH) estimator

 

Abstrak

Cerapan berpengaruh (IO) ditakrifkan sebagai cerapan sama ada bersendirian atau bersama dengan beberapa cerapan lain yang mempunyai kesan memudaratkan ke atas nilai kiraan pelbagai anggaran. Oleh itu, sangat penting untuk mengecam kehadiran cerapan berpengaruh. Beberapa kaedah telah dicadangkan untuk mengecam IO termasuk kaedah penambahbaikan jarak berpengaruh pantas (FIID). Kaedah FIID telah ditunjukkan lebih cekap dibandingkan dengan kaedah sedia ada. Walau bagaimanapun, kaedah FIID mempunyai kelemahan iaitu pengiraannya tidak stabil, masih mempunyai kesan penyorokan dan limpahan, isu masa pengiraan yang panjang dan tidak menggunakan titik genting yang betul. Kaedah teguh versi baharu bagi jarak berpengaruh yang berasaskan penganggar berpemberat konsisten pantas dan titik musnah tinggi (RFIID) dicadangkan untuk mengatasi masalah ini. Keputusan data sebenar dan kajian simulasi Monte Carlo menunjukkan RFIID berupaya untuk mengasingkan IO daripada keseluruhan data dengan masa pengiraan paling singkat, kesan limpahan paling kecil tanpa kesan penyorokan dibandingkan dengan kaedah lain dalam kajian ini.

 

Kata kunci: Cerapan berpengaruh; jarak berpengaruh; penganggar pantas tekal berpemberat dan titik musnah tinggi; titik tuasan baik

 

REFERENCES

Midi et al. (2020)

Belsley, D., Kuh, E. & Welsch, R. 2004. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Hoboken, New Jersey: John Wiley & Sons, Inc.

Chatterjee, S. & Hadi, A.S. 1986. Influential observations, high leverage points, and outliers in linear regression. Statistical Science 1(3): 379-393.

Devlin, S.J., Gnanadesikan, R. & Kettenring, J.R. 1981. Robust estimation of dispersion matrices and principal components. Journal of the American Statistical Association 76(374): 354-362.

Gunst, R.F. & Mason, R.L. 1980. Regression Analysis and Its Application: A Data Oriented Approach. New York: Marcel Dekker.

Habshah, M. & Shabbak, A. 2011. Robust multivariate control charts to detect small shifts in mean. Mathematical Problems in Engineering 2011: 923463. doi: 10.1155/2011/923463

Habshah, M., Muhammad, S. & Ismaeel, S.S. 2021. Fast improvised influential distance for the identification of influential observations in multiple linear regression. Sains Malaysiana 50(7): 2085-2094.

Habshah, M., Talib, H., Jayanthi, A. & Uraibi, H.S. 2020.  Fast and robust diagnostic technique for the detection of high leverage points. Journal of Science and Technology 28(4): 1203-1220.

Habshah, M., Norazan, M.R. & Rahmatullah Imon, A.H.M. 2009. The performance of diagnostic-robust generalized potentials for the identification of multiple high leverage points in linear regression. Journal of Applied Statistics 36(5): 507-520.

Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. & Stahel, W.A. 2011. Robust Statistics: The Approach based on Influence Functions. Hoboken, Ney Jersey: John Wiley & Sons, Inc.

Mohammed, A., Habshah, M. & Rahmatullah Imon, A.H.M. 2015. A new robust diagnostic plot for classifying good and bad high leverage points in a multiple linear regression model. Mathematical Problems in Engineering 2015: 279472. doi.org/10.1155/2015/279472

Nurunnabi, A.A.M., Nasser, M. & Imon, A.H.M.R. 2016. Identification and classification of multiple outliers, high leverage points and influential observations in linear regression. Journal of Applied Statistics 43(3): 509-525.

Olive, D.J. & Hawkins, D.M. 2010. Robust Multivariate Location and Dispersion. Preprint, www. Math. Siu. Edu/olive/preprints. Htm

Olive, D.J. & Hawkins, D.M. 2008. High Breakdown Multivariate Estimators. https://www.researchgate.net/profile/David_Olive2/publication/240737720_High_Breakdown_Multivariate_ Estimators/links/ 0a85e53234b7db7f90000000.pdf

Rahmatullah Imon, A.H.M. 2005. Identifying multiple influential observations in linear regression. Journal of Applied Statistics 32: 929-946.

Rahmatullah Imon, A.H.M. 2002. Identifying multiple high leverage points in linear regression. Journal of Statistical Studies 3: 207-218.

Rashid, A.M., Midi, H., Dhnn, W. & Arasan, J. 2021a. An efficient estimation and classification methods for high dimensional data using robust iteratively reweighted SIMPLS algorithm based on Nu-Support vector regression. IEEE Access 9: 45955-45967.

Rashid, A.M., Midi, H., Dhnn, W. & Arasan, J. 2021b. Detection of outliers in high-dimensional data using Nu-Support vector regression. Journal of Applied Statistics 49(10): 2550-2569.

Rousseeuw, P. & Leroy, A.M. 1987. Robust Regression and Outlier Detection. New York: Wiley Series in Probability and Mathematical Statistics.

Rousseeuw, P. & Yohai, V. 1984. Robust regression by means of S-estimators. In Robust and Nonlinear Time Series Analysis. New York: Springer.

Welsch, R.E. 1980. Regression sensitivity analysis and bounded-influence estimation. In Evaluation of Econometric Models, edited by Kmenta, J. & Ramsey, J.B. Massachusetts:  Academic Press. pp. 153-167.

Zahariah, S. & Midi, H. 2022. Minimum regularized covariance determinant and principal component analysis-based method for the identification of high leverage points in high dimensional sparse data. Journal of Applied Statistics 50(13): 2817-2835.

 

*Corresponding author; email: habshah@upm.edu.my

 

 

 

 

 

 

 

 

 

previous